Open Rstudio to do the practicals. Note that tasks with * are optional.
In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:
memisc
(version: 0.99.27.3)MASS
(version: 7.3.54)R version 4.1.0 (2021-05-18)
Explore the pre-loaded data set iris
.
Species
group?setosa
is different from 0.5. Define the null and alternative hypothesis.setosa
is smaller than 0.5.Use the R function percent(…) from the memisc package to obtain the percentages.
Rule of thumb: \(150*0.3333\) and \(150*(1-0.3333)\) are large so we can use the normal approximation.
Use the function prop.test() to investigate whether the sample proportion is different from a particular value.
The prop.test() function presents the X-squared test statistic. To calculate the test statistic manually, use the connection between the X-squared and normal distribution.
Check the argument alternative of the prop.test(…) function.
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
percent(iris$Species)
## setosa versicolor virginica N
## 33.33333 33.33333 33.33333 150.00000
# obtain the number of subjects in each group
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
The proportion of each species is (50/150=) 0.33.
Null hypothesis: the probability of the group setosa is equal to 0.5.
Alternative hypothesis: the probability of the group setosa is different from 0.5.
Rule of thumb: \(150∗0.3333\) and \(150∗(1−0.3333)\) are large so we can use the normal approximation.
# Without continuity correction
<- prop.test(x = 50, n = 150, p = 0.5, alternative = "two.sided", correct = FALSE)
test_res_cat test_res_cat
##
## 1-sample proportions test without continuity correction
##
## data: 50 out of 150, null probability 0.5
## X-squared = 16.667, df = 1, p-value = 4.456e-05
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2628876 0.4121024
## sample estimates:
## p
## 0.3333333
# With continuity correction
<- prop.test(x = 50, n = 150, p = 0.5, alternative = "two.sided", correct = TRUE)
test_res_cat test_res_cat
##
## 1-sample proportions test with continuity correction
##
## data: 50 out of 150, null probability 0.5
## X-squared = 16.007, df = 1, p-value = 6.312e-05
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2598209 0.4155318
## sample estimates:
## p
## 0.3333333
If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the proportion sample of the setona group is equal to 0.5.
The test statistic (without continuity correction) is: \(z=\frac{p-\pi_0}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}}\)
The test statistic (with continuity correction) is: \(z=\frac{p-\pi_0 + c}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}}\),
where
and
Connection between normal and X-squared distribution: If \(X_1, X_2, \dots, X_k\) are independent random variables from the standard normal distribution, then the random variable \(Q = X^2_1+X^2_2+ \dots +X^2_k\) follows the \(\chi^2\) distribution with \(k\) degrees of freedom. The prop.test() function presents the X-squared test statistic. Therefore, we take \(z^2\).
# Without continuity correction
# Calculate the test statistic manually
<- (0.33333 - 0.5) / sqrt((0.5*0.5)/(150))
z ^2 z
## [1] 16.66733
# With continuity correction
# Calculate the test statistic manually
<- (0.33333 - 0.5 + 1/300) / sqrt((0.5*0.5)/(150))
z ^2 z
## [1] 16.00732
prop.test(x = 50, n = 150, p = 0.5, alternative = "less")
##
## 1-sample proportions test with continuity correction
##
## data: 50 out of 150, null probability 0.5
## X-squared = 16.007, df = 1, p-value = 3.156e-05
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.4025292
## sample estimates:
## p
## 0.3333333
Explore the pre-loaded data set iris
.
setosa
is different from 0.5 using the exact test.setosa
is smaller than 0.5 using the exact test. Obtain the p-value manually and confirm the result with the function.Use the function biom.test(…) to investigate whether the sample proportion is different from a particular value using the exact test.
The binomial distribution model deals with finding the probability of success of an event which has only two possible outcomes in a series of experiments. For any possible outcome of the binomial we obtain the corresponding probability. We can calculate the p-value (without using the biom.test() R function) by considering the probability of seeing an outcome as, or more, extreme.
We can calculate the p-value (without using the biom.test() R function) using the pbinom() or sum(dbinom()) R functions.
binom.test(x = 50, n = 150, p = 0.5, alternative = "two.sided")
##
## Exact binomial test
##
## data: 50 and 150
## number of successes = 50, number of trials = 150, p-value = 5.448e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.2585564 0.4148430
## sample estimates:
## probability of success
## 0.3333333
For a one-tailed test, \(H_1:\pi<\pi_0\) \(p-value = \Pr(X = 0) + \dots + Pr(X = k) = \sum_{i=0}^{k}Pr(X = i) = \sum_{i=0}^{k} ({n \atop i}) p^i (1-p)^{n-i}\),
where \(k=50\), \(n=150\) and \(p=0.5\):
pbinom(q = 50, size = 150, prob = 0.5)
## [1] 2.723767e-05
# or
sum(dbinom(x = 1:50, size = 150, prob = 0.5))
## [1] 2.723767e-05
<- binom.test(x = 50, n = 150, p = 0.5, alternative = "less")
test_res3 $p.value test_res3
## [1] 2.723767e-05
Let’s assume that we want to investigate whether there is a difference in the probability of lung cancer between males and females. We have:
Rule of thumb: \(100*50/100\), \(100*(1-50/100)\), \(70*55/70\) and \(70*(1-55/70)\) are large so we can use the normal approximation.
Use the R function prop.test() to investigate whether the proportions of the two groups are different.
To calculate the test statistic manually, use the connection between the X-squared and normal distribution.
Null hypothesis: the probability of lung cancer in males is equal to females.
Alternative hypothesis: the probability of lung cancer in males is different from females.
Rule of thumb: \(100*50/100\), \(100*(1-50/100)\), \(70*55/70\) and \(70*(1-55/70)\) are large so we can use the normal approximation.
prop.test(x = c(50, 55), n = c(100, 70), alternative = "two.sided")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(50, 55) out of c(100, 70)
## X-squared = 13.049, df = 1, p-value = 0.0003034
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.4351281 -0.1363005
## sample estimates:
## prop 1 prop 2
## 0.5000000 0.7857143
If we assume that the significant level is equal to 0.05, the output indicates that we can reject the null hypothesis that the sample probabilities are the same.
Test statistic (with continuity correction) of pooled version:
\(z = \frac{(p_1-p_2) + \frac{F}{2}\big(\frac{1}{n_1} + \frac{1}{n_2}\big)}{\sqrt{ p(1-p) \big(\frac{1}{n_1} + \frac{1}{n_2} \big)}}\),
where
Connection between normal and X-squared distribution: If \(X_1, X_2, \dots, X_k\) are independent random variables from the standard normal distribution, then the random variable \(Q = X^2_1+X^2_2+ \dots +X^2_k\) follows the \(\chi^2\) distribution with \(k\) degrees of freedom. The prop.test() function presents the X-squared test statistic. Therefore, we take \(z^2\).
<- 50/100
p1 <- 55/70
p2 <- 100
n1 <- 70
n2 <- (n1*p1 + n2*p2)/(n1 + n2)
p
= ((p1 - p2) + (1/2)*(1/n1 + 1/n2)) / (sqrt( p*(1-p)*(1/n1 + 1/n2) ) )
z ^2 z
## [1] 13.04926
Explore the pre-loaded data set survey
(from the package MASS
).
Sex
and Clap
are independent. Define the null and alternative hypothesis.Use the R function chisq.test() to investigate whether two categorical variables are independent.
Null hypothesis: there is no association between the variables sex and clap.
Alternative hypothesis: there is an association between the variables sex and clap.
chisq.test(table(survey$Sex, survey$Clap))
##
## Pearson's Chi-squared test
##
## data: table(survey$Sex, survey$Clap)
## X-squared = 0.25373, df = 2, p-value = 0.8809
If we assume that the significant level is equal to 0.05, the output indicates that we cannot reject the null hypothesis that the variables sex and clap are independent.
The test statistic (without continuity correction) is:
\(X2 = \sum_{i=1}^K \frac{(O_i-E_i)^2}{E_i}\),
where \(K\) are the contingency table cells, \(O\) is the observed value and \(E\) the expected value.
# Calculate the test statistic manually
# Observe the total number of observations for the rows and columns:
addmargins(table(survey$Sex, survey$Clap))
##
## Left Neither Right Sum
## Female 21 24 73 118
## Male 18 25 74 117
## Sum 39 49 147 235
# Obtain the observed and expected data:
<- c(table(survey$Sex, survey$Clap))
Obs <- c(118*39/235, 117*39/235, 118*49/235, 117*49/235, 118*147/235, 117*147/235)
Exp
<- sum((Obs - Exp)^2/Exp)
test_statistic test_statistic
## [1] 0.2537294
Explore the pre-loaded data set survey
(from the package MASS
).
Sex
and M.I
are independent using the exact test.Use the R function fisher.test() to investigate whether two categorical variables are independent using the exact test.
fisher.test(table(survey$Sex, survey$M.I))
##
## Fisher's Exact Test for Count Data
##
## data: table(survey$Sex, survey$M.I)
## p-value = 0.7678
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.4770082 1.6534163
## sample estimates:
## odds ratio
## 0.8893934
Explore the pre-loaded data set survey
(from the package MASS
).
Sex
and M.I
are independent for the Clap
equal to Left
group.We have small sample size, so it might be a better idea to use the exact test.
Use the R function fisher.test() to investigate whether two categorical variables are independent using the exact test.
# Select only the left group from the Clap variable
<- survey[survey$Clap == "Left", ]
survey_leftClap
fisher.test(table(survey_leftClap$Sex, survey_leftClap$M.I))
##
## Fisher's Exact Test for Count Data
##
## data: table(survey_leftClap$Sex, survey_leftClap$M.I)
## p-value = 0.1527
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.04958942 1.48553847
## sample estimates:
## odds ratio
## 0.2899661
© Eleni-Rosalina Andrinopoulou